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Core visual object recognition refers to the rapid recognition of the identity of an object during 

a single fixation. In humans and other primates, since this process happens very quickly, it is 

thought to be the result of primarily feedforward neural processing in the ventral visual stream 

[1]. Today, the best performing computer vision systems are based on deep neural networks 

(DNNs), which often achieve or even surpass human performance on object recognition tasks. 
® DNNs are currently championed as models of the neural processing underlying human object 

recognition, based on an observed correspondence between patterns of activity in DNNs and 
neural activity throughout the ventral visual stream [2]. However, the full multiplicity of 
human vision is not well captured by a single accuracy value. More nuanced characterization 
of visual behavior is needed. Constructive research in this area probes the source not just of 
correspondences but also the various divergences between of human and machine vision. In 
this issue of PLOS Biology, Jang and colleagues [3] explore one such divergence: noise 
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Standard-trained DNN 


Fig 1. Noise-robust vision in humans and machines. Human visual object recognition is robust to various kinds of 
noise. DNNs trained according to standard procedures are significantly less robust to noise. However, fine-tuning with 
noisy images not only makes DNNs more robust; it also brings the behavior and activity of the network into greater 
alignment with the human visual system. DNN, deep neural network; SSNR, signal-to-signal-plus-noise ratio. 


https://doi.org/10.1371/journal.pbio.3001477.g001 


Noise-trained DNN 





By parametrically varying the amount of noise added to the image stimuli, Jang and col- 
leagues quantified performance on the object recognition task as a function of the signal-to- 
signal-plus-noise ratio (SSNR). This allowed for the calculation of recognition thresholds for 
both human viewers and the computer vision models. The behavior of both human and 
machine viewers was also characterized with relevance maps. For the DNNs, these heatmaps 
show the regions of the image that are most diagnostic of the network’s classification. Human 
viewers were asked to “paint” the regions of the image that were most informative for their 
decision. Examples of these relevance maps can be seen in Fig 1. Both recognition thresholds 
and relevance maps were more human-like for noise-robust DNNs [3]. 

This research extends previous work showing that DNNs are severely affected by various 
image corruptions and that the patterns of errors they make on such images do not mirror the 
mistakes that humans make [5]. The SSNR threshold for the noise-trained DNN reported by 
Jang and colleagues was slightly lower than that of the human viewers [3]. This is in line with 
previous work, which found that data augmentation can lead to superhuman performance on 
the specific image corruptions seen during training [6]. Rusak and colleagues [7] also demon- 
strated that careful noise training can help DNNs generalize to unseen image corruptions as 
well. The neuroimaging results presented by Jang and colleagues [3] provides novel evidence 
that noise training brings the network’s internal information processing, not just its output, 
into greater alignment with that of the human visual system. 

What does this body of research say about the mechanisms underlying human noise robust- 
ness? Jang and colleagues speculate that robustness to visual noise is acquired, at least in part, 
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through learning and experience, but the exact mechanisms by which visual experience 
imparts robustness remains an open question. Jang and colleagues showed that their noise 
training procedure allowed the network to generalize to natural weather conditions. The 
hypothesis implied by these results is that simple exposure to a variety of viewing conditions 
will generalize to conditions that share statistical properties. However, Geirhos and colleagues 
[5] conclude that humans and DNNs generalize in fundamentally different ways. Their analy- 
sis, which included many different types of image transformations, found transformations that 
appear very similar to human viewers but which did not enable generalization in DNNs (net- 
works trained on one do not generalize to the other). There are likely additional inductive 
biases that influence human generalization to corrupted images that are not captured in cur- 
rent DNN models and training algorithms. Since Jang and colleagues only investigated 2 types 
of noise, their experiments are less well suited to address the generalization question. 

The comparison of human and machine perception is fraught with challenging complica- 
tions. Funke and colleagues highlight how human bias can affect the interpretation of results, 
the challenge of aligning the experimental conditions between human and machine viewers, 
and the importance of distinguishing between necessary and sufficient conditions [8]. As this 
is a new area of study, the methodological and interpretive norms are still being established. 
For example, several authors have questioned the validity of attribution methods, including 
the method used by Jang and colleagues to produce their relevance maps [9,10]. However, the 
desiderata for methods used in machine learning research may be different than those for 
comparison between human and machine perception. The fact that these relevance maps 
showed some alignment with the human diagnostic regions may provide indirect evidence for 
their bearing on human perception. 

DNNs persist as the best model of human visual object recognition despite growing docu- 
mentation of the ways in which they deviate from human behavior and neural activity. These 
deviations do not necessarily provide cause for the rejection of such models. Rather, they pro- 
vide useful signals for their refinement. For example, recently, Xu and Vaziri-Pashkam pub- 
lished a thorough comparison of 14 DNNs to activity throughout the human visual system. 
They found that although early visual regions were well captured by the activity of early net- 
work layers, significant variance was left unaccounted for in high-level visual areas [11]. The 
comparison in Jang and colleagues found that noise training increased the brain-model corre- 
spondence particularly at higher-level visual areas [3]. Thus, together, these results point to 
candidate model refinements to ultimately build better models of the neural information pro- 
cessing underlying human vision. 
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